In this investigation, I wanted to look at the characteristics of duration that could be used to predict their start time. The main focus was on the start date, and genders of users.
This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
Trip Lengths in the dataset take on a very large range of values, from about 0 at the lowest, to about 82,000 at the highest. Plotted on a logarithmic scale, the distribution of diamond prices takes on a right-skewed shape.
The duration variable took on a large range of values, so I looked at the data using a log transform.
We will take a look at how the day of the week affects the amount of trips made in a day. Also, the days of the week and date columns had to be changed to strings so that I could generate the histogram.
There's a strong positive relationship between end_station_latitude and start_station_latitude. Also between end_station_longitude and start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.
We can see that there's no significant differences in these plots. It is possible that users are a bit younger on Tuesdays and Fridays than usual, as the lower side of IQR goes down more than the other days of the week.
<Figure size 720x720 with 0 Axes>
Birth Year had a surprisingly high amount of correlation with the duration of the ride. An approximately exponential relationship was observed when duration was plotted. Box plots tell us that there aren't huge differences across the gender of user, and the day of the week. There was also an interesting relationship observed between start_time and end_time. start_station_longitude. On the other hand, we can also see a strong negative relationship betwen start_station_longitude and start_station_latitude and end_station_longitude and start_station_latitude, as well as end_station_longitude and end_station_latitude.
Text(0.5, 1.0, 'Correlation on Birth Year and Duration')
I extended my investigation of start time against duration in this section by looking at the impact of the three categorical quality features. The multivariate exploration here showed that there is an increased number of values on birth time when younger, but in the second plot, it is hard to see any relationship from this one.
Looking at the point plots, it doesn't appear that the three category features have a systematic interaction impact. The features, on the other hand, aren't completely self-contained.
Text(0.5, 1.08, 'Correlation Between Duration, Age and Startdate')
Let's see how the days of the week are related to duration and start time. It's fascinating to see how the start time plot for duration relates to the days of the week.